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Abstract 

We use the x 2 ~ divergence as a measure of diversity between 
probability densities and review the basic properties of the estimator 
D2O || •)• I n the sequence we define a few objects which capture relevant 
information from the sample of a Markov Chain to be used in the 
definition of a couple of estimators i.e. the Local Dependency Level and 
Global Dependency Level for a Makov chain sample. After exploring 
their properties we propose a new estimator for the Markov chain 
order. Finally we show a few tables containing numerical simulation 
results, comparing the perfomance of the new estimator with the well 
known and already established AIC and BIC estimators. 
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1 Introduction 



A Markov Chain is a discrete stochastic process X = {X n } n>0 with state 
space E, cardinality \E\ < oo for which there is a k > 1 such that for 
n > k, (xi, x n ) G E n 

P(X X = xi, .., X n = x n ) = P(Xi = xi, .., X fc = Xfc)n™ =A;+1 Q(a;j|a; i _fc, 2j-i) 

for suitable transition probabilities The class of processes that holds 

the above condition for a given k > 1 will be denoted by A4k, and .Mo will 
denote the class of i.i.d. processes. The order of a process in U°i M.i is the 
smallest integer k such that X = {X n } n>0 G M. K . 

Along the last few decades there has been a great number of research on 
the estimation of the order of a Markov Chains, starting with M.S. Bartlett 
0, P.G. Hoel US], I.J. Good [15J, T.W. Anderson & L.A. Goodman @], P. 
Billingsley [7], [8] among others, and more recently, H. Tong [23J, G. Schwarz 
[22], R.W. Katz [TJ], I. Csiszar and P. Shields [UJ, L.C. Zhao et all [23j had 
contributed with new Markov chain order estimators. 

Since 1973, H. Akaike [TJ entropic information criterion, known as AIC, has 
had a fundamental impact in statistical model evaluation problems. The 
AIC has been applied by Tong, for example, to the problem of estimating the 
order of autoregressive processes, autoregressive integrated moving average 
processes, and Markov chains. The Akaike- Tong (AIC) estimator was derived 
as an asymptotic approximate estimate of the Kullback-Leibler information 
discrepancy and provides a useful tool for evaluating models estimated by 
the maximum likelihood method. Later on, Katz derived the asymptotic 
distribution of the estimator and showed its inconsistency, proving that there 
is a positive probability of overestimating the true order no matter how large 
the sample size. Nevertheless, AIC is the most used and succesfull Markov 
chain order estimator used at the present time, mainly because it is more 
efficient than BIC for small sample. 

The main consistent estimator alternative, the BIC estimator, does not per- 
form too well for relatively small samples, as it was pointed out by Katz JT7J 
and Csiszar & Shields [llj. It is natural to admit that the expansion of the 
Markov Chain complexity (size of the state space and order) has significant 
influence on the sample size required for the identification of the unknown 
order, even though, most of the time it is difficult to obtain sufficiently large 
samples. 

In this notes we'll use a different entropic object called x 2 ~ divergence, and 
study its behaviour when applied to samples from random variables with 
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multinomial empirical distributions 



X = {X,} 



Ki<r 



derived from a Markov Chain sample. Finally, we shall propose a new 
strongly consistent Markov Chain order estimator more efficacious than the 
already established AIC and BIC, which it shall be exhibited through the 
outcomes of several numerical simulations. 

In Section 2 we succinctly review the concept of / — divergence and its 
properties. In Section 3, the x 2_ divergence estimator is defined reviewing 
some results concerning its convergence, as well as we briefly elaborate about 
the Law of Iterated Logarithm (LIL) for our particular situation. In Section 4 
the Makov chain sample is brought to attention, some notation introduced 
and the estimators Local Dependency Level and Global Dependency Level, 
which are the groundsill of the consistent Markov chain order estimator, 
subsequently defined. Finally, in Section 4 we describe the procedures used 
and the results obtained in an exploratory numerical simulations. 



2 Entropy and f-divergences 
2.1 Definitions and Notations 

An / — divergence is a function that measures the discrepancy between two 
probability distributions P and Q. The divergence is intuitively an average 
of the function / of the odds ratio given by P and Q. 

These divergences were introduced and studied independently by Csiszar, 
Csiszar&Shields and Ali&Silvey among others ([ID], [12], [3]) and sometimes 
are referred as Ali-Silvey distances. 

Definition 2.1. Let P and Q be discrete probability densities with support 
S(P) = S(Q) = E = {1, ...to}. For f(t) convex function defined for t > 
and /(l) = 0, the f — divergence for the distributions P and Q is defined as 



Here we take 0/(§) = , /(0) = lim^o /(*), 0/(g) = lim^ t/(f) = 
alim^oo^. ♦ 
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For example: 

/(*) = tlog(t) =► D f (P\\Q) = D(P\\Q) = P{a) log 

a&A 

which are called relative entropy and \ 2 ~ divergence, respectively. From 
now on the \ 2 ~ divergence shall be denote by D 2 (P\\Q). 
Observe that the triangular inequality is not satisfied in general, so that 
D 2 (P\\Q) defines no distance in the strict sense. 




A basic theorem about /-divergences is the following approximation by the 

D 2 (P\\Q). 

Theorem 2.1. (Csiszar & Shields fH%[ ) If f is twice differentiable at t=l and 
f (1) > then for any Q with support S(Q) = A and P close to Q 

Df(P\\Q) ~ f ~^D 2 {P\\Q). 

Formally, D f (P\\Q) / D 2 (P\\Q) -> /"(l)/2 as P 4 Q ♦ 

The x 2 -square divergence D 2 (P\\Q) test is well known statistical test proce- 
dure close related to the chi-square distribution. See [19] for thorough and 
detailed references. 



2.2 The A2 Estimator 

Now we'll consider a set X = {X\, ...,X r } of discrete independent random 
variables and {Px,}i<i<r their probability distributions with common sup- 
port S(P Xi ) =E = {1,2,..., m}, i = l,...,r. 

Let Xj = {x\ l \ xj; ) be a random sample of Xi with size rii, respectively, 
where n = ^21=1^- To test the distribution's homogeneity of {Px,}i<i<r, 
we compare the empirical observed and expected frequencies, by means of 
the x 2 -divergency. 
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Definition 2.2. Let the observed empirical random variables defined as 

C%{i,j) = \k:X\ k) =j\, i = l,...,r, jeE, 
the expected empirical random variables given by 

& n [i, J) - — rf,. . , i - 1, J E hi, 

and t/ie respectives probability functions 

P oX (i,j) = ^A2lA, i = l,..., r , 
n n 

JVftj'H^^, < = l,...,r, jG£. 

n n 

Let define the estimator 

i=l j=l °» v ,JJ 



Remark 2.2. The above empirical random variables A 2 (P x \\Pex) is the 

chi-squared hypothesis test applied to a 2-dimensional contingency table where 
0*(i,.) are the observed frequencies of X, and E^(i,.), the expected fre- 
quencies under the assumption that e X, are independent identically 
distributed. 



3 Derived Markov Chains 

Let X" = (Xi, X n ) be sample from a multiple stationary Markov chain 
X = {X n } n >i of unknown order k. Assume that X take value on a finite with 
state space E = {1,2, ...,m} with transition probabilities given by 

p(x K+1 \x1) = P(X n+1 = x n+1 \X%_ K+1 = xl) > (1) 
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where x* = x{ Xj +1 = (xi, x K ) G E K . 



Following Doob [13], from the process X we can derive a first order MC, 

Y (k) = {Y^ ] } n > by setting Y^ k) = (X n , X n+<s _i) so that for v = (i u i K ) 

and w = (i'i, 



v) = Pi 



0, otherwise. 



Clearly is a first order and homogeneous MC that will be called the 
derived process, which by ([I]) is an irreducible and positive recurrent MC 
having unique stationary distribution, say 77 K . It is well known, see [[T3]- 
Chap. 5.3], that the derived Markov Chains Y^, I > k is irreducible and 
aperiodic, consequently ergodic. 

There exists an equilibrium (stationary) distribution 77 K (.) satisfying for any 
initial distribution v on E K 

lim \P U (Y^ =x K 1 )-n K (x K 1 )\=0, 

and 

z" x 

Likewise, for Y^, I > k 

77,(4) = n K (xi)p(x K+1 \x1)...p(xi\x l l Z l K ) = J2 n i( xx ^P( x i\ xx l-ly ( 2 ) 

x 

which shows that 77; defined above, is a stationary distribution for Y®. For 
the sake of notation's simplicity we'll use, from now on 



77(4) = 77,(4), 1>k. (3) 



Now, let us shift our attention back to X" = (X\,X2, ...,X n ) and define 

n-l+l 

7V(4|X^) = = = *i) (4) 
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i.e. the number of ocurrences of x[ in Xj\ If I = we take N( . |X") = n. The 
sums are taken over positive terms 7V(a;' 1 +1 |X™) > 0, or else, we convention 
0/0 or O.oo as 0. 

Now we define the empirical random variables X iQ , foiiEE and a G E v . 

Definition 3.1. For a = (a 1 ,...,a r) ) = a\ G E v and i G E, let X ia be 

the random variable taking values in E, extracted from the MC sample X\, 
defined as 



P(X ia = l) = 



N(ia r (l 


\x n i) 


N{ia\\ 





with X ia = \ Xj^, ...,X^ a ^j its sample of size Ui a - 

As in Section 2.2, now we'll particularize X to the set X a = {Xi a , ...,X ma } 
of discrete independent random variables aiming to the definition and appli- 
cation of the corresponding 0* a (i,j), E* a (i,j) and A 2 (0% a ||E* a ). For the 
sake of simplicity they'll be denoted by 0°(i,j), and ^ 2 (0°||E°), 

respectively. 

Observe that for i,j<EE 



where O" is the empirical random variables that describe the X ia , 1 < % < m 
observed frequencies. Likewise, we define the expected frequencies 



0«(i,j) = JV(<a?|X?) 




£«0£(M) 



and the respective probability functions 



Po » (i ' j)_ iVK|x?)' 




i,j G E 



N(a* | X?) 



i,j G E. 



Finally 



r 



rn 



{Po^3)-P^3)f 



^2(-F'o«||-F > e«) 



-EE 



i=l j=l 



n A 2 (Po a II p E a )- 



(6) 
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Now we derive a version of the Law of Iterated Logarithm, significant for the 
establisment of subsequent results about the convergence of A 2 (Po« ||Pe«). 

Lemma 3.1. ^(Theorems 17.0.1 & 17.2.2) Let X = {X n } n>0 be a er- 
godic Markov chain with finite state space E and stationary distribution LJ , 
g: E — > R ; S n (g) = £J =1 g{Xj) and 

n 

a\ = E w (g 2 (X 1 )) + 2 £ E % (g(X 1 )g(X j ))) 

then: 

(a) If a 2 = 0, then a.s. lim^oo -^\S n (g) - E w (S n (g))) = 0. 

(b) If a 2 > 0, then a.s. 

S n (g) - E n (S n (g)) 
hm sup — -j^^^^=^^^^ = 1 



^2a 2 g nlog{log(n)) 



and 

^2a 2 g nlog{log{n)) 
(En ■' expectation with initial distribution II; a.s. : almost surely). ♦ 



Lemma 3.2. fT^ (Lemma 2) IfY^ is ergodic then for rj > k— 1, a = (al) 
and iaj— (i, ai, a v ,j) = (i a\f) G E v+2 we have a.s. 



h ™ P nlog(log(n)) = 2 IT(t a? j)(l-p0 1* a?)). ♦ 



Theorem 3.3. Let us refer to (Ej) for the definition of /^(-Po^ II-Pb"); as we H 
as the beginning of the present section for complementary definitions and 
references related to the following result: 

If K < i], there exist C < oo so that for every a = {%{) G E v 
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P limsup 



^2{.Pqz\\Pk) 
2 log(log(n)) 



< £ 



(7) 



If 7] = k — 1, there exist a\ & i,j,k ^ i such that, p(j \ ia[) ^ p(j \ ka{), 
consequently 



P limsup 



1. ♦ 



2 log (log (n)) 

Proof: The following proof shall be divided in the next two cases. 
Case I: < k < 77. 

From ( [21] , Lemma 3. 1 ) and by Definition 12.21 we can calculate 

O a n (iJ)-E<* n (iJ) = N(ialj\XZ 
or, in the limit 

lim (0 a n (t,j)-E^,j)) 2 = lim (iV(za?i|X^)-iV(z<|X7)p(i|za?)) : 



N(i a[ 


X?)iV(a?j 


x?) 


N(a!{ 


x?) 



lim sup 

n— >oo 

lim sup 



o^,j)-K(U) 

n log(7o#(n))P E «(z,j) 

N(<a?j|X»)-^(<a?|Xy)p(7|ia? 



n log(Zog(n)) 



Similarly 

lim Pe»(«, j) 







IX?) 




X?) iV(a?| 


x?) 



lim 

n— »oo 



iV(za? 


X?) n 


N{a\j 


IX?) 


n 


N(a r ( 


X?) iV(a? | 


x?) 



Q l) rj/ m PO' I a l! 



0(i,j) >0. 
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By ([T]) and Lemma [3.21 we have that mimj 9(i,j) > with 



C = min 9(i, j) V V i7(i a? j) (1 - p(j |iaj)) < 1 
i=i j=i 

P | limsup ^ 2 ( P o°ll^°) < £ ) = l. 



Case JJ: 77 = k — 1. 

In accordance with the following 



lim 

n— »oo J7, n— 5>oo ' 77, 



, Mi a? IX?) 

lim v 1 1 — ^ = 77(ia?) a.s. 



n— >oo 77, 



we can obtain, as in previous case 

E a (i i) 
lim PWiJ) = lim— 



iV(7a? 




IX?) 




X?) iV(a?| 


x?) 



and 



EaeE i7 ( GG l)Ea6i? i7 ( aG l) 



hm Po S (*,,) = lim °&iL = hm ( 

n— >oo n— 5>oo iV Q n Xi n— >oo V 



N{ia\j\ 




N(a\ 


x?) 



J2a£E n ( aa l) 



Clearly, if 77 = k — 1, there exist a = a[ & z, j G E so that 

^lim(Po«(i,i)-P E «(i,j))^0 

since, otherwise, it should imply that 

i7«j) 



P(j I i a?) 



J2aeE n ( aa i) 
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i.e. p(j | i aj) does not depend on i e E, contradicting the assumption that 
the order k> tj. 



and (jSj) is proved. / 

3.1 Local and Global Dependency Level 

In what follows we define the Local Dependency Level and the Global De- 
pendency Level. 

Definition 3.2. Let X n = {Xi}™ =l be a sample of a Markov chain X of order 

k, k > and A 2 (Po^\\Pe^), ct = (ai), V > be as in Definition \2.<\ 

Let us assume that V is an exponential random variable with parameter A 





V(x) = P(V >x) = e 



We define the Local Dependency Level LDL n {cH[), for a = a r [ as 




and the Global Dependency Level GDL n (rj) as 




♦ 



Observe that, if the hypothesis Hq is true, then Va 



I, rj > k, 




(9) 



and for rj = k — 1 




(10) 
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By (|UJ) and (jTUJ) it is clear that, for n sufficiently large, 
P (gDLM « o) = 1, 77 = «-l, 

and 

P (gDLM « =1, ry > k. 

and consequently, for a multiple stationary Markov chain X ra >! of order k 

K = ^ lim GDL n (r]) = V(C), rj = 0,1,.., B, 

n— >oo 

k = max < 77 : lim GDL n (rj) =0 1 + 1- 



Finally, let us define the Markov chain order estimator based on the infor- 
mation contained in the vector GDL n . 

Definition 3.3. Given a fixed number < B G N, let us define the set 
S = {0, 1} B+1 and the application T : S — > N 



T(s) = -1 <£> Si = 1, i = 0,1,. .,5 

T(s) = max {i : Si = 0, s i+x = V(C)} , s = (s , s x , s B )- ♦ 

0<i<B 



Definition 3.4. Let X n = {Xj}™ =1 be a sample for the Markov chain X of 
order k,0<k<B&N and {GDL n (i)}f =l as above. We define the order's 
estimator He DL(X n ) as 

K GDL (X n ) = T(a n ) + 1 
with a n G S so that V s G S 

B B 

(GDL n (i) - a n (i)) 2 < J2(GDLn(i) - s(*)) 2 - ♦ 

i=0 i=0 
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By . (fTU|) and f!3.ip it is clear that, for n large enough, {GDL n (i)}f =1 
satisfies the hypothesis of therefore, the order estimator converges almost 
surely to its value, i.e., 



P ( lim K GDL (X r 

\n- 



0. 



1,2, 



B. 



4 Numerical Simulations 

Clearly, for a given 77, the random variables LDL n ((i{) and GDL(rj) contains 
much of the information concerning the sample's relative dependency, never- 
theless, numerical simulations as well as theoretical considerations anticipates 
a great deal of variability for small samples. 

The numerical simulation starts on with the creation, based on an algorithm 
due to RafterypZTj. of a Markov chain transition matrix, Q = (qi 1 i 2 ...i K -i K+1 ) 
with entries 

K 

qi 1 i 2 ...i K ;i K+1 = ^ \i t R(i K+ i,i t ), 1 < < m. 

t=\ 

where the matrix 

m 

R(i,j), < i, j < m, ^R(i,j) = 1, 1 < j < m 

i=i 

and the positive numbers 

K 

{K}i=l, 2J = ! 
i=l 

were arbitrarily chosen in advance. 

Once the matrix R and the set of numbers {Aj}^ =1 are selected, a Markov 
chain sample of size n, space state E and transition matrix Q is generated. 
It is quite intuitive that the random information about the order of a Markov 
chain, is spread over an exponentially growing set of multinomial empirical 
distributions with |0| = m B+2 , where B is the maximum integer k, as 
in a = (iii 2 ...ik)- It seems reasonable to think that a small viable sample, 
i.e. samples able to retrieve enough information to estimate the chain order, 
should have size of n« 0{m B+2 ). 
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Table 1: Markov Chain Examples with \E\ 



Qi 



0.05 0.05 0.90 
0.05 0.90 0.05 
0.90 0.05 0.05 



0.05 0.05 0.90 
0.05 0.05 0.90 
0.05 0.05 0.90 



|£| = 3 <-> n = 1.000 




Q 1 44> Xi=l/2, i=l,2. 


Q l <^> A i= l/3, i=l,..,3. 


Q 2 A»=l/3, i=l,..,3. 




K = 2 


K = 3 


K = 


k 


Aic 


Bic 


Edc 


Gdl 




Bic 


Edc 


Gdl 




Sic 




Gdl 





















43% 


100% 


100% 


98% 


l 








1,5% 










53% 






2% 


2 


100% 


100% 


100% 


98,5% 




99,5% 


84,5% 


40% 


4% 






1% 


3 










100% 


0.5% 


15,5% 


60% 










4 


























5 
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Whenever the sample is obtained, the order estimators GDL, AIC and BIC 
are calculated, based on the same Markov chain sample and the numerical 
results registered in the form of tables. To conclude, we would like to point 
out that everything related to numerical simulations were done using the 
remarkable free software R[20\. 



5 Conclusion 

The pioneer research started with the contributions of Bartlett[B], Hoel[16j, 
Good [IS], Anderson & Goodman [1], Billingsley([7], [S]) among others, where 
they developed tests of hypothesis for the detection of the order of a Markov 
chain. 

Later on these tools were adapted and improved with the used of Penalty 
Functions (Tong[23j, Katz[T7]) together with other tools created in the realm 
of Models Selection (Akaike[lJ, Schwarz[22]). Since then, there have been 
a considerable number of subsequent contributions on this subject, several 
of them consisting in the enhancement of the already existing techniques 
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Table 2: Markov Chain Examples with \E\ = 4. 

0.05 0.85 

0.05 0.85 

0.05 0.85 

0.05 0.85 



\E\ = 4 O n = 5.000 




Q 3 A i= l/2, i=l,2. 


Q 3 <3> \i=l/3, i=l,..,3. 


Q 4 O A i= l/3, i=l,..,3. 




K = 2 


K = 3 


K = 


A; 


Aic 


Bic 


Edc 


Gdl 




Bic 




Gdl 




Bic 


Edc 


Gdl 





















85% 


100% 


100% 


100% 


1 


















15% 








2 


100% 


100% 


100% 


100% 




100% 




5,5% 










3 










100% 


0.5% 


100% 


94,5% 










4 


























5 
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(CsiszarPU, Zhao et all|23]). 

In this notes we propose a new Markov chain order estimator based on a 
different idea which make it behave in a quite different form. This estimator is 
strongly consistent and more efficient than AIC (inconsistent), outperforming 
the well established and consistent BIC, mainly on relatively small samples. 
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